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Abstract 

Bayesian shrinkage methods have generated a lot of recent interest as tools for 
high-dimensional regression and model selection. These methods naturally facilitate 
tractable uncertainty quantification and incorporation of prior information. This benefit 
has led to extensive use of the Bayesian shrinkage methods across diverse applications. 

A common feature of these models, including the Bayesian lasso, global-local shrinkage 
priors, spike-and-slab priors is that the corresponding priors on the regression coefficients 
can be expressed as scale mixture of normals. While the three-step Gibbs sampler used 
to sample from the often intractable associated posterior density has been shown to 
be geometrically ergodic for several of these models (Khare and Hobert, 2013; Pal and 
Kharc, 2014), it has been demonstrated recently that convergence of this sampler can 
still be quite slow in modern high-dimensional settings despite this apparent theoretical 
safeguard. In order to address this challenge, we propose a new method to draw from 
the same posterior via a tractable two-step blocked Gibbs sampler. We demonstrate that 
our proposed two-step blocked sampler exhibits vastly superior convergence behavior 
compared to the original three- step sampler in high-dimensional regimes on both 
real and simulated data. We also provide a detailed theoretical underpinning to the 
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new method in the context of the Bayesian lasso. First, we prove that the proposed 
two-step sampler is geometrically ergodic, and derive explicit upper bounds for the 
(geometric) rate of convergence. Furthermore, we demonstrate theoretically that while 
the original Bayesian lasso chain is not Hilbert-Schmidt, the proposed chain is trace 
class (and hence Hilbert-Schmidt). The trace class property has useful theoretical and 
practical implications. It implies that the corresponding Markov operator is compact, 
and its (countably many) eigenvalues are summable. It also facilitates a rigorous 
comparison of the two-step blocked chain with “sandwich” algorithms which aim to 
improve performance of the two-step chain by inserting an inexpensive extra step. 

Keywords: Bayesian shrinkage; Gibbs sampler; Hilbert-Schmidt operator; Markov chain; 
Scale mixture of normals. 


1 Introduction 

In modern statistics, high-dimensional datasets, where the number of covariates/features is 
more than the number of samples, are very common. Penalized likelihood methods such as 
the lasso (Tibshirani, 1996) and its variants simultaneously inducing shrinkage and sparsity 
in the estimation of regression coefficients. These goals are especially desirable when the 
number of coefficients to be estimated is greater than the sample size. One drawback of 
these penalized method is that it is not immediately obvious how to provide meaningful 
uncertainty quantification for the coefficient estimates. An alternative solution is to pursue a 
Bayesian approach by using shrinkage priors - priors that shrink the coefficients towards zero, 
by either having a discrete point mass component at zero (e.g., spike and slab priors George 
and McCulloch (1993)) or a continuous density with a peak at zero. The uncertainty of the 
resulting estimates can be quantified in a natural way through the usual Bayesian framework 
(e.g., credible intervals). 

In fact, one can interpret the lasso objective function (or some monotone transformation 
thereof) as the posterior under a certain Bayesian model with an independent Laplace prior 
on the coefficients, as was noted immediately by Tibshirani, and developed further by Park 
and Casella Park and Casella (2008) into the Bayesian lasso approach. Following the Bayesian 
lasso, a rich and interesting class of “continuous global-local shrinkage priors” has been 
developed in recent years, (see, for example, Arniagan et ah, 2013b; Carvalho et ah, 2010; 
Griffin and Brown, 2010; and the references therein). The priors are typically scale mixtures 
of normals, and have a peak at zero to promote shrinkage. 

However, for most of these methods, the resulting intractable posterior is not adequately 
tractable to permit the closed-form evaluation of integrals. To address this problem, the 
respective authors have typically proposed a three-block Gibbs sampler based on a hierarchical 
formulation of the prior structure. This structure, which is essentially a type of data 
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augmentation, leads to a tractable three- step Gibbs sampler (one step for (3^ cr^ and the 
augmented parameter block each) that can be used to draw from the desired posterior. These 
posterior samples can then be used to construct credible intervals or other quantities of 
interest. 

Bayesian shrinkage methods (including the Bayesian lasso) have been extensively used 
in applications as diverse as genetics, hnance, ecology, image processing, neuroscience, and 
clinical trials, receiving over 1,000 citations in total (see, for example, Yi and Xu, 2008; 
de los Campos et al., 2009; DeMiguel et al., 2009; Jacquemin and Doll, 2014; Xing et al., 
2012; Mishchenko and Paninski, 2012; Gu et al., 2013; Pong-Wong, 2014; Pong-Wong and 
Woolliams, 2014) example). Nevertheless, the three-step Gibbs sampling schemes needed 
to implement these methods in practice can require considerable computational resources. 
Such computational concerns have primarily limited the application of these approaches to 
problems of only moderately high dimension (e.g., in the tens or hundreds of variables). This 
is often a serious shortcoming since for many modern applications, the number of variables 
p is in the thousands if not more. Thus, methods to analyze or improve the efficiency of 
the Bayesian lasso algorithm, in terms of either computational complexity or convergence 
properties, are an important topic of research for modern high-dimensional settings. 

Khare and Hobert (2013) proved that the three-step Gibbs sampler of Park and Casella 
(which works for arbitrary n and p) is geometrically ergodic for arbitrary values of n and p and 
they provide a quantitative upper bound for the geometric rate constant. Geometric ergodicity 
of the three-step samplers for the Normal-Gamma and Dirichlet-Laplace shrinkage models 
(Griffin and Brown (2010) and Bhattacharya et al. (2015), respectively) was established in 
Pal and Khare (2014). However, it has been demonstrated that the geometric rate constant 
does indeed tend to 1 if p/n —)• oo (Rajaratnam and Sparks, 2015). Thus, despite the 
apparent theoretical safeguard of geometric ergodicity, these three-step Gibbs samplers can 
take arbitrarily long to converge (to within a given total variation distance of the true 
posterior) if p/n is large enough. This fact is problematic since the so-called “small n, large p” 
setting is precisely where the use of the lasso and other regularized regression methods is 
most benehcial and hence most commonly espoused. 

Since the convergence properties of the original three-step Bayesian lasso Gibbs sampler 
deteriorate in high-dimensional regimes, it may be asked whether there exist alternative 
schemes for sampling from the same posterior that maintain a reasonably fast (i.e., small) 
geometric convergence rate when p is large compared to n. Two commonly employed 
approaches to constructing such alternative MGMG schemes within the Gibbs sampling 
context are known as collapsing and blocking. In a collapsed Gibbs sampler, one or more 
parameters are integrated out of the joint posterior, and a Gibbs sampler is constructed on 
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the posterior of the remaining parameters. Although a collapsed Gibbs sampler converges 
at least as fast as its uncollapsed counterpart (Liu et ah, 1994), the resulting distributions 
may not be adequately tractable to permit the implementation of such a scheme in practice. 
In a blocked Gibbs sampler (also called a grouped Gibbs sampler), multiple parameters are 
combined and sampled simultaneously in a single step of the cycle that forms each iteration. 
It is generally understood that blocking usually improves the convergence rate with a careful 
choice of which parameters to group into the same step (Liu et ah, 1994). 

In this paper, we propose a two-step/two-block Gibbs sampler in which the regression 
coefficients (3 and the residual variance are drawn in the same step of the Gibbs sampling 
cycle. This method turns out to be just as tractable as the original three-step/three-block 
Gibbs samplers. Indeed, the distributional forms of our proposed sampler coincide with 
the original chains, differing only in the shape and scale of the inverse gamma distribution 
from which cx^ is drawn. Also, unlike the original three-step/three-block Gibbs chain, the 
two-step/two-block chain is reversible. We demonstrate empirically that in regimes where p 
is much larger than n, the convergence rate of the proposed two-block Gibbs samplers are 
vastly superior to those of the original three-block schemes. 

Next, we undertake a rigorous investigation into the theoretical properties of the proposed 
two-block chain in the context of the Bayesian lasso. We hrst establish geometric ergodicity for 
the blocked chain, and obtain an explicit upper bound for the rate of convergence. Geometric 
ergodicity (along with other mild regularity conditions) implies the existence of a Markov 
chain CLT, which allows users to provide standard errors for Markov chain based estimates 
of posterior quantities. 

We also prove that the (non-self-adjoint) Markov operator associated with the original 
Gibbs sampling Markov chain is not Hilbert-Schmidt (eigenvalues of the absolute value of 
this operator are not square-summable) for all n and p. On the other hand, we prove that 
the (positive, self-adjoint) Markov operator associated with the proposed blocked chain is 
trace-class (eigenvalues are in fact summable, and hence square-summable). Note that all 
the aforementioned eigenvalues are less than 1 in absolute value. These results indicate 
that the proposed Markov chain is more efficient than the original three-step Bayesian lasso 
chain. The blocked Markov chain can also be regarded as a Data Augmentation algorithm. 
Sandwich algorithms, which aim to improve the performance of DA algorithms by inserting 
an inexpensive extra step between the two steps of the DA algorithm, have gained a lot 
of attention in recent years (see Liu and Wu (f999); kleng and van Dyk (1999); Robert 
and Marchev (2008); Khare and Robert (2011) and the references therein). The trace class 
property for the blocked chain, along with results from Khare and Robert (2011), guarantees 
that a large variety of sandwich Markov chains are themselves trace class, and their eigenvalues 


4 


are dominated by the corresponding eigenvalues of the blocked chain (with at least one strict 
domination). Thus, our trace class result provides theoretical support for the use of sandwich 
Markov chains to further improve speed and efficiency for sampling from the desired Bayesian 
lasso posterior distribution. 

It is well-understood (see for example Jones and Robert (2001)) that proving geometric 
ergodicity and trace class/Hilbert Schmidt properties for practical Markov chains snch as 
these can be rather tricky and a matter of arf where each Markov chain requires a genuinely 
nniqne analysis based on its specific strnctnre. Hence, a theoretical stndy of the two-step 
and three-step Markov chains (similar to the study undertaken in this paper for the Bayesian 
lasso) is a big and challenging nndertaking, and is the task of ongoing and future research. 

The remainder of the paper is organized as follows. In Section 2, we revisit the original 
three-step chain. We propose the two-step chain in Section 3. Section 4 provides a numerical 
comparison of the original and the proposed versions for both Bayesian lasso and the spike 
and slab priors (as representatives from the classes of continnons and discrete shrinkage 
priors) on simulated data, while Section 5 does the same for real data. We then focus on the 
theoretical properties of the original and proposed methods in the context of the Bayesian 
lasso model, and nndertake a rigorous investigation in Section 6. 

2 Bayesian shrinkage for regression 

Consider the model 


Y \ f3,a^ r^Nn{tiln + Xf3, a^In), (1) 

where Y G M” is a response vector, X is a known n x p design matrix of standardized 
covariates, /3 G is an nnknown vector of regression coefficients, cr^ > 0 is an nnknown 
residnal variance, and /i G M is an nnknown intercept. As mentioned previously, in modern 
applications, often the nnmber of covariates p is mnch larger than the sample size n. To 
obtain meaningful estimates and identify signihcant covariates in this challenging situation, a 
common and popnlar approach is to shrink the regression coefficients towards zero. In the 
Bayesian paradigm, this can be achieved by using spike and slab priors, which are mixtures of 
a normal density with a spike at zero (low variance) and another normal density which is flat 
near zero (high variance), see Mitchell and Beauchamp (1988); George and McCulloch (1993). 
A popular and more compntationally efficient alternative is to nse “continnons” shrinkage 
prior densities that have a peak at zero, and tails approaching zero at an appropriate rate, see 
Park and Casella (2008); Carvalho et al. (2010); Poison and Scott (2010); Kyung et al. (2010); 
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Griffin and Brown (2010); Arniagan et al. (2013a); Poison et al. (2013) and the references 
therein. 

A common thread that runs through all the above shrinkage priors for high-dimensional 
regression is that they all can be expressed as scale mixtures of normal densities. In particular, 
the prior density for these models can be specihed as 


/3 I (T^,r ~ Np(0p, 0-^1?.^), r~7r(T), (2) 

where 7r(r) is a prior on r = (ti, ..., r^). Suppose further that the prior on and fi is once 
again the improper prior 7r(cr^, /i) = 1/cr^ and that this prior is independent of the prior on r. 
After combining this prior structure with the basic regression model in (1) and integrating 
out p, the remaining full conditional distributions are 


t\ f3,a^,Y r-. I 
a^\f5,T,Y ^ InverseGamma 


[n + p- l)/2, \\Y - Xl3\\l/2 + /3"i:>;'/3/2 


(3) 


(3\a\T,Y r^NJA-^X^Y, 


where At = X^X + D~^ and Dt = Diag{Ti,T2, ■ ■ ■ ,Tp). If it is feasible to draw from 
7i{t I /3, 1^), then the three conditionals above may be used to construct a useful three- 

step Gibbs sampler to draw from the joint posterior 7r(/3, | 1^). The one-step transition 

density k with respect to Lebesgue measure on x IR+ given by 

f3i,r,Y)7i{f3i\T,alY)7i{T \ f3o,al,Y) dr. (4) 


Many commonly used Bayesian methods for high-dimensional regression can be characterized 
in the form of the priors in (2) and the Gibbs sampler in (3). We now present a few such 
examples. 


2.1 Spike-and-Slab Prior 

Now suppose instead that the Tj are assigned independent discrete priors that each assign 
probability Wj to the point kjQ and probability 1 — Wj to the point Q, where Q > 0 is 
small, Kj > 0 is large, and 0 < tCj < 1. This formulation is a slight modihcation of the 
prior proposed by George and McGulloch (1993) to approximate the spike-and-slab prior of 
Mitchell and Beauchamp (1988). Then the conditional posterior distribution of r | (3,a‘^^Y 
is a product of independent discrete distributions that each assign probability Wj to the 
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point KjQ and probability 1 — Wj to the point Q, where 


w,' = < 1 + 




Wi 


exp 


/5| 


S - 1 


-1 


2 a 2 V KjCj 

by straightforward modihcation of the results of George and McCulloch (1993). 


2.2 The Bayesian lasso 

Suppose that the r/s are assigned independent Exponential{X‘^/2) priors. It follows that the 
marginal prior of (3 (given cr^) assigns independent Laplace densities to each component. In 
fact, the posterior mode in this setting is precisely the lasso estimate of Tibshirani (1996). 
Hence, Park and Casella Park and Casella (2008) refer to this approach as the Bayesian lasso. 
In this setting, the conditional posterior distribution of r | l3,a‘^,Y assigns independent 
inverse Gaussian distributions to each I/tj, and hence is easy to sample from. 

Following the three-step Gibbs sampler of Park and Casella, two useful alternative Gibbs 
samplers for sampling from the Bayesian lasso posterior were proposed by Hans (2009). 
However, both samplers, namely the standard and the orthogonal sampler, require a sample 
size n at least as large as the number of variables p, since they require the design matrix to 
have full column rank. The standard sampler can perform poorly when the predictors are 
highly correlated, while the orthogonal sampler becomes very computationally expensive as p 
grows (recall that p still must be less than n). The n > p assumption can thus be a serious 
limitation, as it precisely excludes the high-dimensional settings targeted by the Bayesian 
lasso. 


2.3 Global-local continuous shrinkage priors 

Suppose we assume that tj = pXj for j = 1,2, ,p. Here, p controls global shrinkage 
towards the origin while the local scale parameters Xj allow component wise deviations 
in shrinkage. Hence, these priors are called as global-local shrinkage priors. A variety of 
global-local shrinkage priors have been proposed in the Bayesian literature, see Poison and 
Scott (2010); Bhattacharya et al. (2015) for an exhaustive list. For almost all of these models, 
sampling from the posterior distribution is performed by using a three-block Gibbs sampler, 
with (3 and being two of the blocks, and all the shrinkage parameters (local and global) 
being the third block. As an example, we consider the Dirichlet-Laplace shrinkage prior from 
Bhattacharya et al. (2015). The regression version of this model was considered in Pal and 
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Khare (2014), and is provided as follows. 


(3 I cr^, 'ip,cl),9 N{0, a^DT-) where r,- = 

(T^ ~ Inverse-Gamma(Q!, .^) > 0 fixed 

• • • ,'^p Exp 

(01, 02, ■■•, (t>p) Dir{a, a, ..., a), 6^ ~ Gamma^pa, -J . (5) 

Here r] = {'ij), (p, 6) denotes the collection of all shrinkage parameters, and a is a known 
positive constant. It can be seen that the full conditional distributions of (3 and cr^ are exactly 
the same as in (3). Based on results in Bhattacharya et al. (2015), it is shown in Pal and 
Khare (2014) that samples from the full conditional distribution of r] can be generated by 
making draws from a bunch of appropriate Generalized Inverse Gaussian densities. 



2.4 Student’s t Prior 

First, suppose that the Tj are assigned independent inverse-gamma priors with shape param¬ 
eter i/j/2 and scale parameter Pj/2. This structure corresponds (by integrating out r) to 
specihciation of the prior on /3 | as a product of independent scaled Student’s t priors, 
where the prior on each /3j \ cx^ has Uj degrees of freedom and scale parameter (pjcr^)^/^. 
Such Student’s t priors are a popular choice when a mildly informative prior is desired (see, 
e.g., Gehnan et ah, 2013). In this case, the conditional posterior distribution of r | f3,a‘^,Y 
is a product of independent inverse-gamma distributions, where the distribution of each 
Tj I /3, (T^, Y has shape parameter (z/j -|- l)/2 and scale parameter {jjj + 0|/ct^)/2. 


2.5 Bayesian Elastic Net 

As a hnal example, suppose instead that the Tj are assigned independent continuous priors, 
each with density 


TT 



Al 

2(1 - A2r,)2 


exp 


AlTj 

2(1 - A2r,) 


with respect to Lebesgue measure on the interval (0, A 2 ^), where Ai, A 2 > 0. This structure 
corresponds (by integrating out r) to specihcation of the prior on /3 | cx^ as a product of 
independent priors, where the prior on each jSj \ has density with respect to Lebesgue 






measure that is proportional to 


7r(/?, I a^) oc exp \Pj\- ■ 

Then the conditional posterior distribution of r | /3, cx^, is 


f3,a\Y ~ ind InverseGaussian 

This prior structure and corresponding Gibbs sampler are known as the Bayesian elastic net 
(Li and Lin, 2010; Kyung et ah, 2010). 

3 The fast Bayesian shrinkage algorithm 

While the three-step Gibbs sampler (with transition density k) provides a useful and straight¬ 
forward way to sample from the posterior density, its slow convergence in high-dimensional 
settings with p » n is discussed by Rajaratnam and Sparks (2015). In particular, it is 
demonstrated that the slow convergence problem in these settings arises due to the high 
a posteriori dependence between (3 and It was argued by Lin et al. (1994) that the conver¬ 
gence rate of Gibbs samplers can often be improved by grouping highly dependent parameters 
together into a single block of a blocked Gibbs sampler. Of course, if the conditional densities 
for the blocked Gibbs sampler are not easy to sample from, then any possible gain from 
blocking is likely to be overshadowed by the extra effort in sampling from such densities. 

In an effort to address the slow convergence of the three-step Gibbs sampler in high¬ 
dimensional settings, we prove the following lemma. Its proof may be fonnd in the Supple¬ 
mentary Material. 

Lemma 1. For the Bayesian model in (2), \t,Y has the inverse gamma distribution 
with shape parameter {n — l)/2 and scale parameter Y'^{In — XA~^X'^)Y /2. 

This lemma facilitates the construction of a novel blocked two-step Gibbs sampler which 
can be nsed to generate samples from the joint posterior density of (/3, cr^), and which is as 
tractable as the original three-step Gibbs sampler. This blocked Gibbs sampler alternates 
between drawing (/3, cr^) | r and r | (/3, cr^). In particniar, (/3, cr^) | r may be drawn by hrst 
drawing cr^ | r and then drawing (3 \ a^,T. In other words, the blocked Gibbs sampler may 
be constructed by replacing the draw of | f3,T,Y in (3) with a draw of | r as given 
by Lemma I. In particular, the blocked Gibbs sampler cycles through drawing from the 
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distributions 


t\ f3,a‘^,Y ^ ti{t I f3,a^,Y) 

a‘^ \ t,Y ~ InverseGamma 

(/3,cr^) I t,Y 


( 6 ) 


n 




(3 \ a\T,Y ^NJ A-^X^Y, a^A -^), 


and has one-step transition density k with respect to Lebesgue measure on x M+ given by 


k[{l3o,al),{l3i,al)]= / \ l3o,al,Y)-K^al \ t,Y) 'K{i3i\ al,T,Y) dr. (7) 


Note that the blocked Gibbs sampler is just as tractable as the original sampler since the 
only the parameters of the inverse gamma distribution are modihed. In particular, although 
the blocked Gibbs sampler requires inversion of the matrix At to draw cr^, this inversion 
must be carried out anyway to draw f3, so no real increase in computation is required. 

Note that our Markov chains move on the {f3,a^) space as these are the parameters of 
interest. In other words, we integrate out the augmented parameter r when we compute our 
transition densities k and k in (4) and (7) respectively. If we consider the “lifted versions” of 
our chains where we move on the (/3, r) space (do not integrate out r for the transition 
densities), then the results in Liu et al. (1994) imply that the Markov operator corresponding 
to the two-step chain has a smaller operator norm as compared to the three-step chain. 
However, no comparisons of the spectral radius, which dictates the rate of convergence of 
the non-reversible lifted chains, is available. Furthermore, no theoretical comparison of the 
actual Markov chains (with transition densities k and k) is available. Hence, in subsequent 
sections, we undertake a numerical comparison (on both simulated and real data) as well as a 
theoretical comparison of the proposed two-step sampler and the original three-step sampler. 

Remark. Note that in the case of many global-local shrinkage priors such as in Section 2.3, 
T is a function of the global and local shrinkage parameters. Hence, to sample from r, one 
needs to sample from the conditional distribution of the collection of shrinkage parameters 
(denoted by rj in Section 2.3) given /3 and cr^. However, the distribution of (/3, a^) depends 
on rj only through r, and samples from the joint density of (/3, u^) given r/ can be generated 
exactly as described in (6). 

Remark. We use the term blocked rather than grouped to avoid confusion with existing 
methods such as the group Bayesian lasso (Kyung et ah, 2010), in which the notion of 
grouping is unrelated to the concept considered here. Note also that the original three-step 
sampler could already be considered a blocked Gibbs sampler since the fdj and Tj are not 
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drawn individually. Thus, our use of the term blocked in describing the proposed two-step 
sampler should be understood as meaning that the Gibbs sampling cycle is divided into fewer 
blocks than in the original Gibbs sampler. 

4 Numerical Comparison 

Note that as illustrated in Section 2 our approach is applicable to a wide variety of Bayesian 
high-dimensional regression methods. In this section, we will undertaken a numerical 
comparison between the convergence rates and efficiency of the original and proposed chains 
for two such methods. One is the spike-and-slab approach from 2.1 (representing methods 
with point mass or discrete shrinkage priors), and the second one is Bayesian lasso approach 
from Section 2.2 (representing continuous shrinkage priors). As a proxy for the actual rate of 
convergence of the chain, we consider the autocorrelation in the marginal chain. Note 
that this autocorrelation is exactly equal to the geometric convergence rate in the simplihed 
case of standard Bayesian regression (Rajaratnam and Sparks, 2015) and is a lower bound 
for the true geometric convergence rate in general (Lin et ah, 1994). 

4.1 Numerical comparsion for Bayesian lasso 

Figure 1 plots the autocorrelation for the original three-step and proposed two-step Bayesian 
lasso Gibbs samplers versus p for various values of n. (Similar plots under varying sparsity 
and multicollinearity may be found in the Supplementary Material). The left side of Figure 2 
plots the same autocorrelation versus log(p/n) for a wider variety of values of n and p. It is 
apparent that this autocorrelation for the proposed two-step Bayesian lasso is bounded away 
from 1 for all n and p. The center and right side of Figure 2 show dimensional autocorrelation 
function (DAGF) surface plots (see Rajaratnam and Sparks, 2015) for the original (left) and 
(right) Bayesian lasso Gibbs samplers. (See the Supplementary Material for details of the 
generation of the various numerical quantities that were used in the execution of these chains). 
It is clear that the autocorrelation tends to 1 as p/n —)■ oo for the original Bayesian lasso but 
remains bounded away from 1 for the the proposed two-step Bayesian lasso. 

4.2 Numerical comparison for spike and slab 

We also applied both spike-and-slab Gibbs samplers to simulated data that was constructed 
in a manner identical to the simulated data for the Bayesian lasso in Figure 1. The resulting 
autocorrelations of the chains are shown in Figure 3. As with the Bayesian lasso, it is 
clear that the two-step blocked version of the spike-and-slab sampler exhibits dramatically 
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Figure 1; Autocorrelation of al versus p for the original three-step and proposed two-step 
Bayesian lasso Gibbs samplers with n = 5 (left), n = 10 (center), and n = 20 (right). See the 
Supplementary Material for details of the generation of the numerical quantities used in the 
execution of these chains. 



- 1.0 0.0 1.0 2.0 


log{p/n) 

Figure 2; Autocorrelation of al versus log(p/n) for the original three-step and proposed 
two-step Bayesian lasso Gibbs samplers with various values of n and p (left).Dimensional 
autocorrelation function (DACF) surface plots for the chain relative to n and p for the 
original (center) and two-step (right) Bayesian lasso Gibbs samplers. See the Supplementary 
Material for details of the generation of the numerical quantities used in the execution of 
these chains. 
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improved convergence behavior in terms of n and p as compared to the original three-step 
sampler. 
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Figure 3: Autocorrelation of al versus p for the original two-step and proposed three-step 
spike-and-slab Gibbs samplers with n = 5 (left), n = 10 (center), and n = 20 (right). See the 
Supplementary Material for details of the generation of the numerical quantities used in the 
execution of these chains. 


5 Applications to Real Data 

We now illustrate the benefits of the proposed two-step approach through its application to 
three readily available regression data sets. 

5.1 Gene Expression Data 

5.1.1 Results for Bayesian lasso 

We first evaluate the performance of the original and proposed chains using the Bayesian lasso 
model through their application to the gene expression data of Scheetz et al. (2006), which is 
publicly available as the data set eyedata in the R package flare (Li et al., 2014). This data 
set includes n = 120 observations of a response variable and p = 200 covariates. The columns 
of X were standardized to have mean zero and squared Euclidean norm n. The regularization 
parameter was taken to be A = 0.2185 so that the number of nonzero coefficients in the lasso 
estimate is min{n,p}/2 = 60. Both the oriiginal three-step and proposed two-step Bayesian 
lassos were executed for 10,000 iterations (after discarding an initial “burn-in” period of 1,000 
iterations), which took about 100 seconds on a 2.6 GHz machine for each chain. Further 
numerical details (such as the initialization of the chains) may be found by direct inspection 
of the R code, which we provide in its entirety in the Supplementary Material. 

We now show that the Markov chain of the proposed two-step Bayesian lasso mixes 
substantially better than that of the original Bayesian lasso when applied to this gene 
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expression data set. Specifically, the lag-one autocorrelation of the chain for the two-step 
Bayesian lasso is 0.3885, whereas it is 0.7794 for the original Bayesian lasso. We also computed 
the effective sample size of the cr^ chain using the ef f ectiveSize function of the R package 
coda (Plummer et al., 2006). The effective sample size is 4,160 for the two-step Bayesian lasso 
and 1,240 for the original Bayesian lasso. These effective sample sizes essentially demonstrate 
that the two-step Bayesian lasso needs only about 30% as many Gibbs sampling iterations to 
obtain an MCMC sample of equivalent quality to that of the original Bayesian lasso. Since 
the computational time for a single iteration is essentially the same for both algorithms, the 
two-step Bayesian lasso also requires only about 30 % as much time as the original Bayesian 
lasso to obtain an MCMC sample of equivalent quality. Such considerations are particularly 
important when considering the high-dimensional data sets to which the lasso and Bayesian 
lasso are often applied. 

Remark. The reader may wonder why we have chosen to examine the autocorrelation and 
effective sample size of the chain instead of the (3 chain (i.e., the individual f3j chains). 
The reason is simply that the autocorrelation and effective sample size of the (3 chain can 
fail to accurately convey the mixing behavior of regression-type Gibbs samplers. In short, 
the sampled iterates of certain quadratic forms in f3 can be very highly autocorrelated even 
though the /3j chains themselves have an autocorrelation near zero. Such convergence behavior 
leads to serious inaccuracy when using the MCMC output for uncertainty quantihcation, 
which is the very reason why the Bayesian lasso is employed in the hrst place. In contrast, 
there are strong theoretical connections between the autocorrelation of the chain and the 
true convergence rate for regression-type Gibbs samplers. See Rajaratnam and Sparks (2015) 
for a thorough discussion of these concepts. 

5.1.2 Results for Spike-and-Slab 

Next, we investigate the performance of the two-step Gibbs sampler in (6) relative to the 
original sampler in (3) using the spike-and-slab prior. Both the original and proposed Gibbs 
samplers were executed for 10,000 iterations (after discarding an initial “burn-in” period of 
1,000 iterations), which took about 80 seconds on a 2.6 GHz machine for each chain. Other 
numerical details may be found by direct inspection of the R code, which we provide in its 
entirety in the Supplemental Material. 

As was the case with the two-step Bayesian lasso, the blocked version of the spike-and-slab 
Gibbs sampler exhibits mixing properties that are substantially better than those of the 
corresponding original sampler. Specihcally, the lag-one autocorrelation of the chain for 
the two-step spike-and-slab Gibbs sampler is 0.0187 whereas it is 0.5174 for the original 
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spike-and-slab Gibbs sampler. Moreover, the effective sample size of the chain is 9,372 
for the fast spike-and-slab Gibbs sampler and 2,977 for the original spike-and-slab Gibbs 
sampler. Thus, the blocking approach of the proposed two-step sampler once again yields a 
sampler that is dramatically more efficient than the original three-step sampler. 

5.2 Infrared Spectroscopy Data 

We next evaluate the performance of the original three-step and proposed two-step approaches 
in the context of Bayesian lasso through their application to the infrared spectroscopy data 
of Osborne et al. (1984) and Brown et al. (2001), which is publicly available as the data set 
cookie in the R package ppls (Kramer et al., 2008). We used the first n = 40 observations 
and the first of the response variables (corresponding to fat content). The p = 700 covariates 
of this data set exhibit high multicollinearity, with a median pairwise correlation of 0.96. 
The columns of X were standardized to have mean zero and squared Euclidean norm n. 
The regularization parameter was taken to be A = 0.0504, which yields min{?7,,p}/2 = 20 
nonzero coefficients in the lasso estimate. Both the original and two-step Bayesian lassos were 
executed for 10,000 iterations (after discarding an initial “burn-in” period of 1,000 iterations), 
which took about 25 minutes on a 2.6 GHz machine for each chain. The R code is provided 
in the Supplementary Material. 

two-stepAs with the gene expression data of the previous subsection, the Markov chain 
of the two-step Bayesian lasso mixes much better than that of the original Bayesian lasso 
when applied to this infrared spectroscopy data set. The lag-one autocorrelation of the 
chain is 0.0924 for the fast Bayesian lasso and 0.9560 for the original Bayesian lasso. The 
effective sample size of the cx^ chain is 7,790 for the fast Bayesian lasso and 225 for the 
original Bayesian lasso. two-stepThus, the two-step Bayesian lasso requires only about 3% 
as many Gibbs sampling iterations to achieve the same effective sample size as the original 
Bayesian lasso. Hence, the two-step Bayesian lasso requires only about 3% as much time as 
the original Bayesian lasso to obtain an MGMG sample of equivalent quality. (Also see the 
previous subsection for further comments on the consideration of the chain specifically.) 

5.3 Communities and Crime Data 

For a third application to real data, we consider the communities and crime data of Redmond 
and Baveja (2002), which is publicly available through the UG Irvine Machine Learning 
Repository (Lichman, 2013) as the Communities and Crime Data Set. We chose 50 covariates 
from that data set and constructed p = 1325 covariates by taking the first-order, quadratic, 
and multiplicative interaction terms involving those 50 raw covariates. We again illustrate 
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the performance of the original three-step and proposed two-step chains un the context of 
the Bayesian lasso model in a small-n, large-p setting. We choose n = 10 of the 1994 total 
observations uniformly at random. Each of the 1325 columns (each of length 10) of the 
resulting X matrix were standardized to have mean zero and squared Euclidean norm n. The 
regularization parameter was taken to be A = 1.331, which yields min{?7,,p}/2 = 5 nonzero 
coefficients in the lasso estimate for the particular observations selected. Both the original 
and two-step Bayesian lassos were executed for 10,000 iterations (after discarding an initial 
“burn-in” period of 1,000 iterations), which took about 90 minutes on a 2.6 GHz machine for 
each chain. The R code is provided in the Supplementary Material. 

Once again, the Markov chain of the two-step Bayesian lasso mixes considerably faster 
than that of the original Bayesian lasso. The lag-one autocorrelation of the cx^ chain is 0.0017 
for the two-step Bayesian lasso and 0.9942 for the original Bayesian lasso. The effective 
sample size of the cx^ chain is 10,000 for the two-step Bayesian lasso and 29 for the original 
Bayesian lasso. (Note that the ef f ectiveSize function involves the use of AIC to select the 
order of an autoregressive model, and obtaining an effective sample size exactly equal to the 
number of MCMC iterations simply indicates that AIC has selected a trivial autoregressive 
model of order zero.) Thus, the two-step Bayesian lasso requires only about 0.3% as many 
Gibbs sampling iterations to achieve the same effective sample size as the original Bayesian 
lasso. Hence, the two-step Bayesian lasso requires only about 0.3% as much time as the 
original Bayesian lasso to obtain an MCMC sample of equivalent quality. (Also see the 
previous subsections for further comments on the consideration of the cx^ chain specifically.) 


Data Set 

n 

P 

Autocorrelation 

Effective Size 


Two-step 

Original 

Two-step 

Original 

Ratio 

Gene Expr. (Spike-Slab) 

120 

200 

0.0187 

0.5174 

9,372 

2,977 

3.15 

Gene Expr. (Bayesian lasso) 

120 

200 

0.3885 

0.7794 

4,160 

1,240 

3.4 

Infrared Spectroscopy 

40 

700 

0.0924 

0.9560 

7,790 

225 

34.6 

Communities & Crime 

10 

1,325 

0.0017 

0.9942 

10,000 

29 

344.8 


Table 1: Autocorrelation and effective sample size for the chain of the two-step Bayesian 
lasso and original Bayesian lasso as applied to the three datasets in Section 5. Note that n 
and p denote (respectively) the sample size and number of covariates for each data set. 


5.4 Summary of Applications 

The results for all three data sets in this section are summarized in Table 1. It is clear that 
the two-step chain and the original three-step chain exhibit the same behavior (in regard 
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to n and p) for these real data sets as for the simulated data sets of Section 4. Specifically, 
the original three-step chain mixes very slowly when p is large relative to n, as illustrated 
by the large autocorrelation and small effective sample size of the chain. In contrast, the 
two-step chain enjoys excellent mixing properties in all n and p regimes and becomes better 
as p gets larger (in fact, there is an unexpected “benefit of dimensionality”). 


6 Theoretical Properties 


In this section we undertake an investigation into the theoretical convergence and spectral 
properties of the proposed two-step chain and the original three-step chains. In particular, we 
investigate whether these Markov chains converge at a geometric rate (geometric ergodicity) 
and also the specific behavior of their singular values (trace class/Hilbert Schmidt properties). 
As stated in the introduction, proving geometric ergodicity and trace class/Hilbert Schmidt 
properties for practical Markov chains such as these can be rather tricky and a matter of 
art, where each Markov chain requires a genuinely unique analysis based on its specihc 
structure. Hence, we provide results here for the Bayesian lasso model outlined in Section 
2.2. A theoretical study of all the other two-step and three-step Markov chains is a big and 
challenging undertaking, and is the task of ongoing and future research. 

We hrst proceed to establish geometric ergodicity of the two-step Bayesian lasso Gibbs 
sampler using the method of Rosenthal (1995). This approach provides a quantitative upper 
bound on the geometric convergence rate. To express this result rigorously, we hrst dehne 
some notation. For every m > 1, let (3o) denote the distribution of the mth iterate for 

the two-step Bayesian lasso Gibbs sampler initialized at (/3o, o'o). Let F denote the stationary 
distribution of this Markov chain, i.e., the true joint posterior distribution of (/3, cr^). Finally, 
let c/tv denote total variation distance. Then we have the following result, which we express 
using notation similar to that of Theorem 2 of Jones and Robert (2001). 

Theorem 2. Let 0 < 7 < 1, 6 = ((n -|- 3 )p )(167 + {n + 3))/(647), and d > 2b/ (1 — 7 ). Then 
for any 0 < r < 1 , 


dTv[Fk{o'Q, (3o), F~\ < (1 


eY^ + 



b , Mml \ 
1-7 cri J’ 


for every k > 1, where e = exp^—py/dj, U = 1 + 2{'yd + b), and a = {1 + d)/{l + 2b + 'yd). 

The proof of Theorem 2 uses the lemma below, which is proven in the Supplementary 
Material. 
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Lemma 3. Let = Y^{In - XA^^X^)Y. Then \\A~^X^Y\\l/C^ < ||r||i/4. 

With Lemma 3 in place, we can now prove Theorem 2. 

Proof of Theorem 2. We appeal to Theorem 2 of Jones and Robert (2001) (see also Theo¬ 
rem 12 of Rosenthal, 1995). To apply this result, it is necessary to establish a drift condition 
and an associated minorization condition. Let V{f3,a‘^) = jS/. Let (/3o,cro) denote 
the initial value for the blocked Markov chain, and let (/3i, af) denote the hrst iteration of 
the chain. Then let Eq denote expectation conditional on (/3o,o'q). To establish the drift 
condition, observe that 


Eo[V{f3i,al)] = X^Eo[af^E{f3ff3, \ n,al)] 

= X^Eo[tT{Af^) + \\Af^X'^Y\\lE{af^ \ r) 

= X^ Eo [tr(A;i) + \\Af^X^Y\\l{n - l)/C^ 

< X^ EQ[ti{DX + ||T||i(n - l)/4] = Eo(||t||i) (n -h 3)AV4, 


( 8 ) 


where the inequality is by Lemma 3 and the fact that tr(A.,.^) = tr[(X"^X -|- i)-i] < 

tr[(iA;i)-i] = ii{D t)- Then continuing from (8), we have 

j = l ^ Z 

by the basic properties of the inverse Gaussian distribution. Now note that u < 5vf -|- (4J)“^ 
for all M > 0 and J > 0. Applying this result to (9) for each j with u = A|/lo,j|/c’'o and 
J = 47/(n -I- 3) yields 


E„y(A. 7 )]< 7 E^ + ^ 


i=i 


CTn 


(n 3)p 
16 7 


+ 


{n + 3)p 


= lV{l3o,al) +6, 


establishing the drift condition. 
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To establish the associated minorization condition, observe that k as dehned in (7) is 
given by 




f XP 

IMlMi 

f3-D-^(3- A^lIrlU 

Y {2n)-P- 

(Po 

2a^ 2 


n 


- 1/2 


(a/2)( 


n-l)/2 


exp 


r(^) 

f (A/ 2 vr)^ 


det(A^)i/2 


2crfy {2'Ka‘l)'P/‘^ ^ 

exp 


2{^-i)/2r(^) 




Ay2(/3^ _ A-^X^Y] 


dr 


a 


n+p+l 

1 


X 


exp I 


Y-XMl fX\\f3o\\, f3^D-^f3o\ , 

- ^-2 - ^-2 - exp . -- dr 


2o{ 


2af 


V ^0 


2a2 


( 10 ) 


Now snppose that K(/3o,cro) — Then A^/3 qj/(Tq < d for each j. Let ^ = d^^^X ^Ip. Then 

fX\\(3oh /3o^;'/3o^ ^ f eD-^^\ 

ew[— -> exp(^- —j =xexp(^A||«||, - 

noting that £ = exp(—A||^||i). Combining this inequality with (10) yields 

.lxl/2^(n-l)/2 




{X/27i)p det{A^D-^Y^" _ f A^HtII/, ^^ 

6Xp I-^7- ) X 


2 (n-l )/2 ^^+P+1 


exp 


Y-Xf5,\\i 

- 2 - 1 ^exp(A||^||i-^- I dr 


2al 

= £ ^[(^,1), (/3i,c^i)] 


2cTf 


for ah (/3, (T^) G X M+. Thus, the minorization condition is established. 


□ 


Note that for Theorem 2 to be useful, the bound must actually decrease with k. Thus, it 
is necessary to choose r small enough that U"^/a^~'' < 1. Then for small enough r, the bound 
is dominated by the term (1 — eY^, which is approximately (1 — reY for small r and £. Now 
observe that d > 2b > n^p/32. It follows that 


1 — re = 1 — rexp(—pd^/^) > 1 — rexp(— 77 ,^^ 2 ^ 321 / 2 ^^ 


which tends to 1 exponentially fast as 77 or p tends to inhnity. Thus, although Theorem 2 
establishes that the two-step Bayesian lasso Gibbs sampler is geometrically ergodic and 
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provides a bound for the rate constant, it is not clear how sharp it is. Hence, the bound may 
not be particularly informative in high-dimensional contexts. 

Additional theoretical insight into the differences between the original and blocked 
sampling schemes may be gained through consideration of the spectral properties of the two 
Gibbs samplers. The following theorem provides a key difference between these two chains. 

Theorem 4. The Markov operator corresponding to the original three-step Gibbs transition 
density k is not Hilbert-Schmidt, while the Markov operator corresponding to the blocked 
Gibbs transition density k is trace class (and hence Hilbert-Schmidt) for all possible values of 
p and n. 

The proof of this result is quite technical and long and may be found in the Supplementary 
Material. As discussed in the introduction, the above result implies that the eigenvalues of 
the absolute value of the non-self-adjoint operator associated with the original Gibbs chain 
are not square-summable. In contrast, it also implies that the eigenvalues of the self-adjoint 
and positive operator corresponding to the blocked Gibbs chain are summable (which implies 
square-summability, as these eigenvalues are bounded by 1). Based on this result, we would 
expect the blocked Gibbs chain to be more efficient than the original Gibbs chain. The 
numerical results (for high-dimensional regimes) provided in Sections 4 and 5 confirm this 
assertion. 
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Supplementary Material 


This supplementary material is organized as follows. Section SI contains proofs of Lem¬ 
mas f and 3 and Theorem 4. Section S2 contains additional numerical results. Section S3 
contains details of the numerical results in Sections 4 and 5. Section S4 contains R code 
for executing both the original and two-step Bayesian lasso algorithms. In addition, the R 
code for generating the numerical results of Sections 4, 5, and S2 is provided in its entirety 
in an accompanying hie Supplementary material comprises proofs of Lemmas I and 3 and 
Theorem 4, additional numerical results, and details of the numerical results in Section 4. 


SI Proofs 

Proof of Lemma 1. Integrating out (3 from the joint posterior 7r(/3, r | Y) yields 


7r(cr^,T I Y) = / 7r(/3,(j^,T | Y) djS 


{xy2y 


X 


2 P 


(27r(j^) 

X J exp 
(AV2)P 


exp I 




2 ^ ^ 
i=i 


(27r(j^) 


2a2 
exp 


Y-Xf3 

A^llrlli 




d(3 


- -Y^iln- XAZ^X^)Y 

2^2 V ^ -r ) 


X / exp 


1 


2ct2 

(AV2)P 




(/3 - 


X^Y 


d/3 


(27r(T2)("-+P+^)/2 det(AT)R2 


exp 


A^llrlli Y^{I^-XA-^X^)Y 


2a2 


Then it is clear that 


7r(cT^ I r, Y) cx 


^u2'j{n+p+l)/2 


exp 


Y'^iln - XA-^X'^)Y 
2^^2 


and the result follows immediately. 


□ 
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Proof of Lemma 3. Let X = UCtV'^ be a singular value decomposition of X, where 
ui,..., ujrain{n,p} are the the singular values, and let Tmax = maxi<j<pTj. Then 


Af^X'^Y 


2 

2 


a 


Y^unv^ivn'^nv^ + D-^)~Wn'^u^Y 
~ Y^Y - Y^unv'^ivn^nv^ + D-^yWn'^u'^Y 
^ Y^unv^ivn'^nv^ + T-ljp)~^vn'^u^Y 

- y^y - Y^unv^iVQ'^nv'^ + r-yipyWn'^u^Y 

Y^un{n'^n + T-yipy^n'^u^Y _ y gy^ 
~ Y^uu^Y - Y^umyt-^n + T-yipy^QPU^Y ~ yJhyJ 


where = U'^Y and 


G = Diag 




ujz 


-K + T-mix)^’ ’i^n + ^mL) 


2 

max/ -I 


H = Diag 


--1 


r. 


-1 


y + p 


-1 ’ 
max 


uj'^ + T 

'^n ' ' 1 


-1 


taking a;* = 0 for alH > p if n > p. Then it is clear that 


AyX'^Y 


a 


< max 

l<i<n 


OJt 


yt + Vax 


--1 


< max 

l<i<n 


At ^ 

' max 


-1 )2 
-1 


+ Vix 


r. 


max 


CJ? + T~^ 

' ' max 


< 


Trr 


< 


T 1 


noting for the second inequality that a/{a + 6)^ < 1/(46) for all a > 0 and 6 > 0. □ 

Proof of Theorem f. Let K be the Markov operator associated with the density k. Let K* 
denote the adjoint of K. Note that K is Hilbert-Schmidt if and only if K*K is trace class, 
which happens if and only if I < oo (see Jorgcns, 1982) where 


I ; = 




k ( (/3, cr^), ^/3, j f (y, , (/3, y)jd(3dyd(3da'^ 



^— ^dyd^dyda"^. 
fiyyayY] 


(Sll) 
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By (4), we get that 



/, I i^<'r,Y)f[p I T,p,Y'jf(T I p,p,Y)d- 

'f \ p,T,Y')f(p \T,P,Y')f(T \ p,P,Y) 
i3,f,Y)f[p I f,P,Y'jf{f I p,p,Y)dTdT. 

It follows from (Sll) and Fubini’s theorem that 

I = f f f f f f f(P \P,r,Y)f(p\T,a'\Y)f{T\P,P,Y) 

Jr^ jrp Jrp Jr+ Jrp Jr+ ^ ^ ^ ' 


+ “+ 

cr^ 


P,f,Y)f(^P\f,a\Y)f{f\(3,a\Y) 


fiP.CT^ I 1") 


/(/ 3 , 


d2 I Y 


dTdrdjSda'^dpda'^. 


A straightforward manipulation of conditional densities shows that 

,d^\Y 

= fiP I r,a^,Y)f(^a^ \ P,f,Y^f(f \ P,d^,Yy 



It follows that 


I = 



I f^,'r,Yjfi^f3 I T,a^,Yjf{T \ (3,a^,Y) 

f[f3 I f,a^,Y)f(a‘^ \ ^,f,Y']f(f \ ^Y^da’^da'^djSd^drdf (S12) 


For convenience, we introduce and use the following notation in the subsequent proof. 

3 = A-^X^Y = A^^X^Y 

Ai = (/3 - ^fAX^ - 3) Ai, = (/3 - A)'^A^(/3 - %) 

A = {Y - X^fiY - X/3) + /3'^D;^/3 + 2^ A, = (1" - X^fiY - X/3) + /3^F)ri/3 + 2^. 

(S13) 
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By (3) we get that 

'f I /3,r,y)/(/3 I T,a\Y)f(T I /},a\Y)f(l3 \ f,a\Y) 

I P,r,Y)f(f I 


X 


^n+^exp(_A/(2a' 


= C, 


|A^|2 exp(-Ai/(2cr2)) 


ni 

i=i 


(a2) 


Tj) 2 exp 


^ n+p+2a +i 


aP 


X 





|A^|2 exp(-Ai^/(2(T^)) 
aP 


X 



a) "exp 


^ n-\-p+2a 

AT^ 




2(j2fj a 


(3, 


- 


QYn I -^1+^1 *+^*+/^^-Pt ^^ 

tiXp [ 20-2 

2.11+2^+P+I 


where 


and 


C, = 


(27r)!>2=^r(5i^i^: 


-2 




= i jj( 


Tj) 2 exp 





B' 


Tj) 2 exp 


A^h/ 


j=i ^ ^ J (. j=i 

It follows by (S14) and the form of the Inverse-Gamma density that 

j j /(^-^ I/3,t, 1^)/(^/3 I T,a^l^)/(r I /3 ,ct^1^) 

f{(3 I f,a^,Y)f(a'^ \ P,f,Y)f(f \ P,a^,Y)da^da^ 


X 


|Ai-|2 |A^|2. 


> G4/i(r,r) 


•r n+p^2a 

A 2 


^ n-\-p-\-2ct 

A7^ 


n+p-\-2oc 


A + (3TD^^(3 


Ai + Ai, + A, + f3^D-^f3 


"+P+2^ +p 


>S15) 


where 


G4 = 2^+2p+2ap|^ n + p^+ 2a A 
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Note that 


Ai, + /3^D;1/3 = 

= (/3-^„f(X^X + Dri + D;i)(/3-3„) + 

(3** - - 3*) + 

= h{T, f) + (/3 - + DTi + D;1)(/3 - 3„), (S16) 

where + D-^)-^X^Y and 

/2(t, f) = - A) + 


Hence, by (S16) and the form of the multivariate t-distribution (see Kotz and Nadarajah, 
2004, for example), we obtain that 


Ai + Ai* + A* + f3^D^^f3 


"+P+2" +p 


df3 


Ai + A, + /2(r, f) + {(3- (3..V{X^X + + D-^){(3 - (3. 


p+{n+2p+2a) 

2 


djS 


r( 




A + Dri + D- 


1-1 


r(^^±f^+p) Ai + A, + /2(T,f) 


n+p+2Q: I p * 
2 '”2 


(S17) 


It follows from (S15) and (S17) that 



5-^ I P,'r,Y)f{p I T,a^Aj/(T | f3,a^,Y) x 
/(/3 I f,cr^l^)/(a^ I /3,f,l^)/(f I p,d‘^,Y^da‘^dd^df3 

r( "+y"" )0f^|(A^A + Dri + 


> C'4/i(t,t) 




A + /3^Dri/3 


n+p+2Q: 


r(!i±ip+p) Ai + A. + /j(x,r) 


n+p+2Q: I p 
2 '”2 


(S18) 


Note that 
/3^Dri/3 


/3^Dri/3 


< 


/3^Dri/3 


< max — 


A [Y - X(3Y{Y - X(3) + (3'^DY(3 + 2^ “ (3'^0^(3 ~ ^<3<v\f, 



(S19) 


25 































and 


A. 


(/3-/3f(A.)(/3-/3) 


< 


{Y - X^YiY - X/3) + ^^Dri/3 + 2^ 

(/3 - 3)^(A.)(/3 - 3) + {y - X^Y {y - X3) 

(Y - X^YiY - A/3) + /3^Dri/3 + 2^ 

- X^^{y - X/3) + /3^Z^;i/3 


< 1 + 


Y - XPYiY - X(3) + f3^DYf3 + 2^ 


f3^DYf3 


(S20) 


/ i 

< 1 + max — 

^<j<P\Tj 

From (S18), (S19), (S20), and the fact 

A, = (Y - X^YiY - A/3) + + 2e > + (Y - X^.f {Y - X%) + 2^ 

(/3* minimizes the L.H.S. as a fnnction of /3), it follows that 



I Yr,Yp^(3 I T,aM"j/(T | A 
f{(3 I f,Y,Y)f(Y I P,f,Y)f(f I Y^‘^,Y)dYda'^df3 


> /3(t,t) 


Ai + A* + /2 (t, t) 


n P I ’ 
2 


(S21) 


where 




r( n+2p+2« )^P|^Tj^ + F)ri + D-1| 


r(^^±2p^+p) 


n+p+2Q: 

2 


> X 


2 + max — 1 ^- 
l<t<p\r,• 


1 + max ( 

/2(T,f) 


_ n+p+2Q: 

2 


+ (^" - x(3,Y{y - aa) + 2e 
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Now, note that 


Ai + A* + /2 (t,t) 

= (/3 - Pf{A^)0 -P) + {Y- Xpf{Y - X/3) + + 2^ + /2(t, f) 

= (^ - Pf{A,)0 -P) + 0- P.VA^0 - 3*) + 

(r - X^,f{Y - X^,) + + 2e + /2 (t, f) 

= (^ — + A^)(^ — /3***) + 

- ;9)'^(A^)(^„* - /9) + + 

(r - X3,)^(l- - Xfy + + 2e + /2(r, f) 

= 0 - 3«*)’^(A-r + A^)(/3 - + /4 (t, f), 

where = (A^. + A^)"^2X^X, and 

/4(r, f) = - ;9 )'^(At.)(3*« - 3) + 0*** - - ;9*) + 

{Y - X^,f{Y - X3,) + + 2^ + /2(t, f). 



It follows from (S21) that 

V I P,T,Y)f(p I T,c^^l")/(T I (3,a\Y) 

f{(3 I f,a\Y)f(a^ \ /3,f,l-)/(f | P,a\Y)daVdf3dp 

> /3(T,f) 




/4(t, t) + (/3 - (A^ + A^) (/3 - 

1 


dfB 


1 + (/I - 0 - A 


4/3 


for every (r, r) G x The last integral is infinite based on the fact that the multivariate 
t-distribution with 1 degree of freedom has an inhnite mean (see Kotz and Nadarajah, 2004). 
The hrst part of the result now follows from (S12). 
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We now focus on proving the second part of the result. In the current setting, to prove 
the trace class property for fc, we need to show (see Jorgens, 1982) that 



, cr^), (/3,(j^)) djSda'^ < oo. 


It follows by (7) that 
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Also, (a) follows from the fact that Y^i^I — XA^^X'^)Y < Y^Y. For convenience, let 



















It follows that 
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It follows that 
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Hence, to prove the required result, it is enough to show that 
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Let denote the maximum eigenvalue of X'^X. Then, by the results of Fiedler (1971), it 
follows that 
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i=i 


i=i 


= c^+ ( ^ + ^ + --- + ^V"' + f+ + —+ ■■■ + 


■■■ + 


1 




TiTj 


^TiT2...Tp 

It follows from (S24) that 

< cP [ expf) dr+ c^'-^ 


(S24) 


r^P-2 


6 

'+ \ i=i 

1 



^ +-4 + '" + ^')*=^p(-vE'‘d‘'’' + 






i=i 



+ 




+ ... + ^ + ...)exp(-L^r,)dr + 


i=i 


a/HTTLA 


^2 P 


6 


(S25) 


j=i 


30 


























Now, note that for any m G {1, 2, • • • ,p}, 



The result now follows by (S23), (S25) and (S26). 



< oo. (S26) 
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S2 Additional Numerical Results 

Figure S4 is similar to the left-hand side of Figure 2 under settings of high multicollinearity 
(left), low sparsity (center), and both high multicollinearity and low sparsity (right). (See the 
following section for details.) It is clear that the behavior seen in Figure I is also seen in 
other settings of multicollinearity and sparsity. 



log{p/n) log(p/n) log(p/n) 


Figure S4: Autocorrelation of the al chain versus p/n for various values of n for the original 
and two-step Bayesian lasso Gibbs samplers in settings of high multicollinearity (left), low 
sparsity (center), and both high multicollinearity and low sparsity (right). See the following 
section for details of the generation of the numerical quantities used in the execution of these 
chains. 


S3 Details of Numerical Results 

Each plotted point in Figures 1 and 2 represents the average lag-one autocorrelation over 10 
Gibbs sampling runs of 10,000 iterations each (after discarding an initial “burn-in” period 
of 1,000 iterations). For each of the 10 runs at each n and p setting, the np elements of the 
n X p covariate matrix X were hrst drawn as N{0, 1) random variables with all pairwise 
correlations equal to 1/5. The columns of X were then standardized to have mean zero and 
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squared Euclidean norm n. Also, for each run, the n x 1 response vector Y was generated as 
Y = X/3* + £, where /3* is a p x 1 vector with its hrst [p/S] elements drawn as independent 
t 2 random variables and its remaining p — [p/S] elements set to zero, and where £ is an n x 1 
vector of independent random variables. The initial values were set as /3o = Ip and cTq = 1. 
The regularization parameter A was set to A = 1. In Figure 1, for each n, the values of p were 
2n, 4n, 6n, 8n, and lOn. The left side of Figure 2 includes the same results as in Figure 1 
but also includes p equal to 2n/5, 3n/5, 4n/5, 7n/5, and 14n/5. The DACF surface plots in 
the center and right side of Figure 2 used each combination of n,p G {5,15,25, 35,45}. 

For the examples in Section 5, the starting points of all chains were set as f3o = Ip and 
CTq = 1. For the crime data in Subsection 5.3, the p = 1325 covariates were the 50 hrst-order 
terms, 50 quadratic terms, and 1225 multiplicative interaction terms involving 50 hrst-order 
covariates, which themselves were chosen as follows. First, all potential covariates with 
missing data were excluded, which left 99 potential covariates. Next, any remaining potential 
covariates for which over half of the values were equal to any single value were excluded, 
which left 94 potential covariates. From the remaining potential covariates, we selected the 
50 with the largest absolute correlation with the response vector. These 50 chosen covariates 
were then standardized to have mean zero before the second-order terms were computed. 
From the 1994 total observations, n = 10 were chosen at random. The final columns of the 
design matrix (now with n = 10 rows) were then standardized again to have mean zero and 
squared Euclidean norm n. 

For the numerical results of the spike-and-slab sampler in Subsection 5.1.2, all settings were 
identical to those used for the corresponding results for the Bayesian lasso, with the obvious 
exception of the hyperparameters in the prior. The hyperparameters of the spike-and-slab 
prior were set as Wj = 1/2 and kj = 100 for each j G {1,... ,p}. For the gene expression 
data, we set (j = 0.00002 for each j G {1,... ,p}. For the simulated data, we set Q = 0.01 
for each j G {1,... ,p}. 

The plots in Figure S4 were constructed similarly to the left-hand side of Figure 2, but 
with the following additional modifications. For the left-hand side and the right-hand side 
(but not the center), the pairwise correlation between each element of the design matrix 
was 4/5 (rather than 1/5). For the center and right-hand side (but not the left-hand side), 
the number of nonzero coefficients for the generation of the response vector was taken to 
be [dp/b] (rather than \p/b\). 

The R function set. seed was used to initialize the random number generation. Different 
seeds were used for each data set and each collection of simulations to facilitate reproducibility 
of individual figures and results. The seeds themselves were taken as consecutive three-digit 
blocks of the decimal expansion of tt (i.e., 141, 592, 653, etc.) to make absolutely clear that 
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they were not manipulated to gain any advantage for the proposed methodology. The reader 
can of course verify directly that the results are qualitatively similar for other seeds. 

S4 R Code 

The R code for generating the numerical results of Sections 4, 5, and S2 is provided in 
its entirety in an accompanying hie. This code incorporates functions or data from the R 
packages coda (Plummer et ah, 2006), flare (Li et ah, 2014), lars (Hastie and Efron, 2013), 
ppls (Kramer et ah, 2008), and statmod (Smyth et ah, 2014). We also reproduce below 
the portion of the R code that executes the original and two-step Bayesian lasso algorithms 
themselves. 

# Slightly modified version of rinvgauss function from statmod package 
rinvgauss. <- function(n,mean=l,shape=NULL,dispersion=l){ 

if(!is.null(shape)){dispersion <- 1/shape} 

mu <- rep_len(mean,n) 

phi <- rep_len(dispersion,n) 

r <- rep_len(0,n) 

i <- (mu>0 & phi>0) 

if(!all(i)){ 

r[!i] <- NA 

n <- sum(i) 

} 

phi [i] <- phi[i]*mu[i] 

Y <- rchisq(n,df=l) 

XI <- 1+phi[i]/2*(Y-sqrt(4*Y/phi[i]+Y“2)) 

# Note: The line above should yield all X1>0, but it occasionally doesn’t due to 

# numerical precision issues. The line below detects this and recomputes 

# the relevant elements of XI using a 2nd-order Taylor expauision of the 

# sqrt function, which is a good approximation whenever the problem occurs, 
if (any(Xl<=0)){Xl[Xl<=0] <- (1/(Y=t=phi [i] ) ) [X1<=0] } 

firstroot <- as.logical(rbinom(n,size=lL,prob=l/(l+Xl))) 
r [i][firstroot] <- XI[firstroot] 
r [i][!firstroot] <- 1/Xl[!firstroot] 
mu*r 
} 

# Draw from inverse-Gaussian distribution while avoiding potential numerical problems 
rinvgaussian <- function(n,m,1){ 

m. <- m/sqrt(m*l) 

1. <- l/sqrt(m*l) 

sqrt(m*l)^rinvgauss.(n,m.,1.) 

} 


33 


# Gibbs iteration functions for both Bayesiaui lassos 

# Note: The versions have separate functions, as opposed to being different 

# options of the sernie function, since the latter would require checking any such 

# options every time the function is called, i.e., in every MCMC iteration. 

# Note: The values of XTY, n, and p can obviously be calculated from X and Y, but they 

# are included as inputs to avoid recalculating them every time the function is 

# called, i.e., in every MCMC iteration. 

iter.bl.original <- function(beta,sigma2,X,Y,XTY,n,p,lambda){ 
d.tau.inv <- rinvgaussian(p,sqrt(lambda~2*sigma2/beta"2),lEmibda"2) 

A.chol <- chol(t (X)yo*7,X+diag(d.tau. inv)) 

beta.tilde <- backsolve(A.chol,backsolve(t(A.chol),XTY,upper.tri=F)) 

Z <- rnorm(p) 

beta.new <- beta.tilde+sqrt(sigma2)*backsolve(A.chol,Z) 
sigma2.new <- (sum( (Y-drop(xy,*yobeta.new) ) ~2) + 
sum(beta.new~2*d.tau.inv))/rchisqCl,n+p-l) 
return(list(beta=beta.new,sigma2=sigma2.new)) 

} 

iter.bl.fast <- function(beta,sigma2,X,Y,XTY,n,p,lambda){ 
d.tau.inv <- rinvgaussian(p,sqrt(lambda~2*sigma2/beta"2),lEmibda"2) 

A.chol <- chol (t (X)yo*y,X+diag(d. tau. inv) ) 

beta.tilde <- backsolve(A.chol,backsolve(t(A.chol),XTY,upper.tri=F)) 
sigma2.new <- (sum(Y~2)-sum(XTY*beta.tilde))/rchisq(l,n-l) 

Z <- rnorm(p) 

beta.new <- beta.tilde+sqrt(sigma2.new)*backsolve(A.chol,Z) 
return(list(beta=beta.new,sigma2=sigma2.new)) 

} 

# Run original and two-step Bayesian lassos 

run.bl <- function(X,Y,lamibda,K,M,outfile.stem,fast=F,keep.beta=T,write.each=F){ 

XTY <- drop(t(X)y.*y.Y) 
n <- dim(X)[1] 
p <- dim(X)[2] 

iter.bl <- get(paste("iter.bl.",ifelse(fast,"fast","original"),sep="")) 
chaindigits <- max(l,ceiling(log(M,10))) # digits needed for chain label strings 

for(chain in 0:(M-1)){ 
beta <- rep(l,p) 
sigma2 <- 1 

chaintext <- substring(format(chain/(lO'chaindigits),nsmall=chaindigits),3) 
outfile.beta <- paste(outfile.stem,,chaintext,"-b.txt",sep="") 
outfile.sigma2 <- paste(outfile.stem,,chaintext,"-s.txt",sep="") 
if(write.each){ 
for(k in 1:K){ 
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iter.result <- iter.bl(beta,sigma2,X,Y,XTY,n,p,lambda) 
beta <- iter.result$beta 
sigma2 <- iter.result$sigma2 
if(keep.beta){ 

beta.row <- matrix(beta,nrow=l) 

write.table(beta.row,outfile.beta,append=T,row.naines=F,col.names=F) 

} 

write.table(sigma2,outfile.sigma2,append=T,row.naines=F,col.names=F) 

} 

}else{ 

beta.chain <- matrix(NA,nrow=K,ncol=p) 
sigma2.chain <- rep(NA,K) 
for(k in 1:K){ 

iter.result <- iter.bl(beta,sigma2,X,Y,XTY,n,p,lambda) 
beta <- iter.result$beta 
sigma2 <- iter.result$sigma2 
beta.chain[k,] <- beta 
sigma2.chain[k] <- sigma2 
} 

if(keep.beta){ 

write.table(beta.chain,outfile.beta,row.names=F,col.names=F) 

> 

write.table(sigma2.chain,outfile.sigma2,row.names=F,col.ncmies=F) 

> 

typetext <- ifelse(fast,"Fast","Drig") 

print(paste(typetext,"chain",chain+1,"of",M,"complete at",dateO)) 
flush. consoleO 
} 

} 
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